House Price Prediction

This project was to develop a Machine Learning model for predicting a house price. Despite there were a number of tree-based algorithms relevant to this application, the project was to examine linear regression and focused on specifically four models: Linear Regression, Ridge Regression, Lasso Regression and Elastic Net.

Overview

(back)

In this article, “variable”" as a general programming term and “feature” denoting a predictor employed in a Machine Learning model are used interchangeably. The following outlines my approach and highlights the logical steps which I followed for developing a Machine Learning models. The development process was highly iterative and the presented steps were not necessarily the exact order. Nevertheless these steps correctly depict the thought process and overall strategies for developing a Machine Learning model.

Data Analysis

(back)

Kaggle House Price Dataset

Downloaded and imported the train data set. Here’s some informaiton by examining the structure and the summary.

## [1] "Imported train data set:  1460  obs. of  81 variables"

Missingness

Next examined the distribution of missingness and the percentage of missing values. There were a few variables with most observations missing, which made these variables not useable and they were consequently removed. Here’s a visualization of missingness of the train data set.

Percentage of Missing Values

Further examination of the percentage of missing values of each variable revealed:

Feature Selection

(back)

Removed a set of variables at this time based on:

Boruta

After having converted all variables in train dataset to integer or numeric fields, programmatically imputed the data for missing values, I ran Boruta to initially analyze the importance of variables. And it took about 40 minutes in the context to itlerate 500 times and produced something like the following results where those in green were with comfirmed importance, while red rejected, i.e. not important features. The yellow ones were tentative which were not yet resolved before reaching the set number of iterations.

Stored the list of features confirmed by Boruta and subsequently removed these features not included in this list from the train dataset.

Features with Insignificant P-Values

While iteratively developing, fitting, and tuning the model, I documented a list of features consistently with insignificant p-values, i.e. greater than 0.05, in test runs. Below is a snapshot of these features to be removed form train dataset prior to executing a test run. Notice these features were not a unique set and various development paths and configurations could and would result a different set of features.

Character Variables

Factor variables were read in as character ones. Rather than converting into factor variables, they were converted into integer or numeric fields for later imputing data as well as deriving feature importance programmatically.

The above, for example, showed the variable, BldgType, was a character variable with five unique levels. It was converted into an ordinal one with values between 0 and 1. Notice that the process was iterative during data preparation and feature engineering. Both converting and combining variables were considered. Domain knowledge, subjectivity, and common sense were all relevant to the what and how to convert a variable, as applicable. The technique and strategies can and will vary from person to person and model to model.

Numeric Variables

For numeric variables, their values can produce unintended effects. For instance, assume modeling a house price having a linear relationship with the month a house is sold. In such case, a generalization is essentially inherited into the model, that a house sold in December with a value of 12 would contribute 12 times more to a response variable than one sold in January with a value of 1. This configuration fundamentally does not correctly reflect the seasonality, nor the degree of impact on a house price based on the month a house is sold.

One alternative way of modeling seasonality is, as shown above, to convert the variable values to a scale between 0 and 1 where in the summer, i.e. July to September, with the most weight contributing to the market house price, the response variable, and in the winter time the least weight to signify the slow period.

Later, this feature was removed from the final model due to insignificance consistently denoted by p-values in a series of test runs. Still, it was necessary to make the effort to prepare the data and convert this variable, from a January-to-December as 1-to-12 scale to a more meaningful and realistic one for describing real-world scenarios. With a proper scale of this and other similar variables, packages like mice could calculate meaningful values for imputation and Boruta for deriving feature importance.

Above all, the strategies to determine what and how to convert a variable have much to do with an examiner’s domian knowledge, subjectivity, and common sense in addition to reviewing the composition and distribution of the data.

And the values of a variable sometimes would not tell the whole story. It may not be the values of a variable, but the variance of those values plays a more influential role for making predictions.

Data Visualization

(back)

Up ot this time, I had an initial set of features to start working on developing a model. Throughout the development, I would make changes of the feature set and observations based on diagnostics of the test results. The presented series of plots were generated along the development process.

Along the development, I produced multiple versions and configurations of the following plots. The set presented here is jsut one of the many.

Prepared Train Dataset

Here’s a snapshot of the prepared data set ready for Machine Learning development.

Distribution of the Label

The label, i.e. response variable, was SalePrice. Here it was plotted without logarithm.

Feature vs. Label

To examine a feature relevant to the label, SalePrice, plotted each pair individually. The linearity among variables was obvious.

Pairs Panel

Here’s a panel plot with all features and the label.

Correleation Matrix

The correlation matrix, variable vs. the lable, and the pairs panel were three main reference to develop an initial model.

Partitioning Data

I partitioned the train dataset into a 70-30 split where 70% for training and 30% for testing. Here is a set of plots produced by fitting the four regression models: Linear, Ridge, Lasso, and Elastic Net.

1. Linear Model

(back)

Here’s a summary of lm for one of the runs. The adjusted R-squared was 0.9067 with insignificant features removed.

1.1 Diagnostic Plots

The diagnostic plots played an important role in the initial development. Many changes and adjustment made were based on examining and interpreting these plots. In easch iteration, I reviewde the plots and changed the composition of features and interactions, removed outliers or added back observations, etc. followed by more test runs. The process was highly iterative and the productivity relied much on well documentation to facilitate the analysis and restore a configuration when needed.

1.2 Variable Importance

1.3 Distribution of Residuals

1.4 Predicted vs. Observed

2. Ridge Regression

Set alpha=0 and a sequence for tuning Lambda. I started from a wide range like 0.001 to 100 and gradually reduced the range to find a good window. The size of a step sometimes had a noticable effect on the outcome. Many experimentation and repetations happened here.

2.1 Regularization

2.2 Variable Importance

2.3 Distribution of Residuals

2.4 Predicted vs. Observed

3. Lasso Regression

Set alpha=1 and a sequence for tuning Lambda. Like what I did in Ridge Regression, I started from a wide range and gradually reduced to and identified a good range and step to scan.

3.1 Regularization

3.2 Variable Importance

3.3 Distribution of Residuals

3.4 Predicted vs. Observed

4. Elastic Net

Initially I set one sequence for tunring both alpha and lambda. This turned out not to be productive. Since in a configuration the two values were far apart from each other, the range for scanning would become relatively extensive with a small step sometimes necessary to initially locate the values. A few times my laptop would run out of resources and simply not responding later in a run.

Setting an individual sequence for alpha and Lambda was a more productive approach for me. Nevertheless, the increased combinations and with 10-fold cross validation, it took longer and a few iteration to narrow the ranges and locate the best set of alpha and lambda.

4.1 Regularization

4.2 Variable Importance

4.3 Distribution of Residuals

4.4 Predicted vs. Observed

Model Comparision

(back)

Boxplots

3D Scalttle Plots

(back)





Now the Fun Just Got Started

At this time, a baeline model was in place and there was certainly much room for improvement. Still the fun had just got started. Using test.csv, the submission file provided by Kaggle, started fine-tuning and improving the model and making submissions.

Closing Thoughts

(back)

Considering this was using multiple linear regression, I was surprised how good the scores based on a few submissions I have done. Linear regression is conceptually simple and relevant to many activities happening in our daily life. We all do linear regression in our mind when making a purchase. Is this is the expensive or cheap? Every time, we ponder that thought, we are doing linear regression in some shape and form.

We must however not mistakenly and carelessly assume linear regression is as simple as it appears, as I have learned from my own mistake. There is much to investigate and learn from linear regression. Ordinary Least Square (OLS) which linear regression is built upon is too fundamental to overlook. The simplicity of OLS offers a clear strategy and enables Machine Learning algorithms to describe the combining effects of a set of predictors based on the distance. The concept of residuals is simple, approach straightforward, and objective clear. Ultimately, we want to minimize the distance of what is observed and what is predicted. This distance is our cost or error function.

There are a few options to continue the development. Tree-based models, ensemble learning, further refining and optimizing the data, more feature engineering, etc. are all applicable. With these many variables, a tree-based model should have a good story to tell. Which is what I plan to try next.

<Read more Yung’s articles.>